YAHMM vs SKLearn Time Test

Author: Jacob Schreiber jmchreiber91@gmail.com

We want to test the speed of the Yet Another Hidden Markov Model (YAHMM) package versus a current standard, scikit-learn. Scikit-learn's HMM support is currently deprecated, and so comparing a list of features may not be appropriate. However, it uses matrix operations which have been optimized in C to perform the underlying math, and so comparing the speed of supported features is a good test to see if YAHMM is in the same general range for speed.


In [1]:
%matplotlib inline
import yahmm
import sklearn
import time

print "Package Versions"
print "YAHMM Version: {}".format( yahmm.__version__ )
print "SKLEARN Version: {}".format( sklearn.__version__ )

from yahmm import *
from sklearn.hmm import *


Package Versions
YAHMM Version: 1.0.0
SKLEARN Version: 0.14.1

We are comparing YAHMM v1.0.0 to sklearn v0.14.1. These calculations were done on a laptop running 64-bit Windows 7, clocked at 2.2 GHz. Download and rerun this notebook to see if YAHMM is faster for your machine. As a note: These times are approximate, and displayed to far more significant figures than they should actually be. Please take the exact numbers with a grain of salt.

Lets first look at the example of a small dense two state HMM which models CG islands in a DNA sequence. DNA sequences are made up of four nucleotides, and CG islands are regions rich in the nucleotides cytosine (C) and guanine (G). It can be biologically useful to identify which regions in the genome are rich in these two nucleotides, and hidden Markov models can decode a sequence with ease.

Lets first define the amount of times we want each of these algorithms to run.


In [2]:
n = 100

The first step is to create a random 100,000 nucleotide sequence, with three CG islands put in. This is done through a simple weighted choice, with the weights being the same as the emission distributions in the two states of the model later.


In [3]:
# Define the nucleotide alphabet
nuc = ['A', 'C', 'G', 'T']

# Define the CG island and background nucleotide probabilities, with
# indexes matching the nuc list.
CG = [0.10, 0.40, 0.40, 0.10]
null = [0.25, 0.25, 0.25, 0.25 ]

# Generate the sequence in seven segments, four background regions and three CG
# regions.
segments = ( np.random.choice( nuc, 20000, True, null ),
             np.random.choice( nuc, 10000, True, CG ),
             np.random.choice( nuc, 20000, True, null ),
             np.random.choice( nuc, 10000, True, CG ),
             np.random.choice( nuc, 20000, True, null ),
             np.random.choice( nuc, 10000, True, CG ),
             np.random.choice( nuc, 10000, True, null ) )

# Concatenate them into one long sequence.
sequence = np.concatenate( segments)

Now that we have a sequence, and a number of times we want to test each algorithm, we need to build the models in both sklearn and YAHMM. This is done in sklearn using a MultinomialHMM, and done in YAHMM using states with DiscreteDistribution emission distributions. Both represent the same underlying model.


In [4]:
# Build the SKLearn HMM using the matrix representation it requires.
# Create the probability of starting in one of the two states.
startprob = np.array([0.9, 0.1])

# Create the transitions between the two states.
transmat = np.array([[ 0.9, 0.1],
                     [ 0.2, 0.8]] )

# The first state has the null distribution, and the second state has
# the CG island distribution.
emissions = np.array([[ 0.25, 0.25, 0.25, 0.25 ],
                      [ 0.10, 0.40, 0.40, 0.10 ]])

# Create the model object. It does not take the emission matrix as an argument,
# and so that property must be set in the next line.
sklearn_model = MultinomialHMM( n_components=2, 
    transmat=transmat, startprob=startprob )
sklearn_model.emissionprob_ = emissions


# SKLearn HMMs do not actually take characters, they take numbers representing
# indexes. We must convert the sequence into these numbers before doing the
# speed test.
d = {'A':0, 'C':1, 'G':2, 'T':3 }
sklearn_seq = map( d.__getitem__, sequence )

The sequence needs to be preprocessed for the sklearn model, as it won't actually take in characters, but rather integers corresponding to each of the characters. This is done before the speed test to make sure it doesn't interfere with our results. Now lets build the YAHMMmodel.


In [5]:
# Create the YAHMM model object 
yahmm_model = Model( "CG Islands" )

# Create the background state, with the uniform distribution.
background = State( DiscreteDistribution( 
    {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25 } ), name="Background" )

# Create the CG island state, with the CG weighted distribution.
island = State( DiscreteDistribution( 
    {'A': 0.10, 'C': 0.40, 'G': 0.40, 'T': 0.10 } ), name="Island" )

# Add both of the states to the model.
yahmm_model.add_states([background, island])

# Add all the transitions from the start to each of the states, and from the
# states to each other.
yahmm_model.add_transition( yahmm_model.start, background, 0.90 )
yahmm_model.add_transition( yahmm_model.start, island, 0.10 )
yahmm_model.add_transition( background, background, 0.90 )
yahmm_model.add_transition( background, island, 0.10 )
yahmm_model.add_transition( island, island, 0.80 )
yahmm_model.add_transition( island, background, 0.20 )

# Bake the model, to solidify the internals.
yahmm_model.bake( verbose=True )

scikit-learn has sequence decoding (Viterbi) and sequence scoring (forward) implemented, and so we compare the speed of YAHMM using those two algorithms, and make sure that we get the same log probability for the scoring. We start off by comparing the Viterbi decoding time on the two packages.


In [6]:
## SKLEARN ##
# Set the starting time to 0
t = 0.

# Run through this loop n times, to get a better measurement of the mean time
# it takes for the algorithms.
for i in xrange( n ):
    # Get the starting time.
    tic = time.time()

    # Decode the sequence using the viterbi algorithm.
    sklearn_model.decode( sklearn_seq )

    # Add the time it took this loop to the total time.
    t += time.time() - tic

scikit_learn_time = t
# Calculate the logp for reference.
logp, path = sklearn_model.decode( sklearn_seq )

# Print the results!
print "scikit-learn: logp = {}, Total Time = {}, Time Per Loop = {}".format( logp, t, t/n )

## YAHMM ##
# Set the starting time to 0 
t = 0.

# Run through this loop n times, to get a better measurement of the mean time
# it takes for the algorithms.
for i in xrange( n ):
    # Get the starting time
    tic = time.time()

    # Decode the sequence using the viterbi algorithm.
    yahmm_model.viterbi( sequence )

    # Add the time it took for this loop to the total time.
    t += time.time() - tic
yahmm_time = t
# Calculate the logp for reference.
logp, path = yahmm_model.viterbi( sequence )

# Print the results!
print "YAHMM: logp = {}, Total Time = {}, Time Per Loop = {}".format( logp, t, t/n )

# Print how much faster one is than the other
if yahmm_time > scikit_learn_time:
    print "scikit_learn is {}x faster".format( yahmm_time / scikit_learn_time )
else:
    print "yahmm is {}x faster".format( scikit_learn_time / yahmm_time )


scikit-learn: logp = -146577.978708, Total Time = 394.746999979, Time Per Loop = 3.94746999979
YAHMM: logp = -146577.978708, Total Time = 33.7730000019, Time Per Loop = 0.337730000019
yahmm is 11.6882420856x faster

YAHMM is significantly faster performing the Viterbi algorithm than scikit-learn is. It is probably safe to say that YAHMM is around an order of magnitude faster. Now, lets compare the forward algorithm time on the two packages,


In [7]:
## SKLEARN ##
# Set the starting time to 0
t = 0.

# Run through this loop n times, to get a better measurement of the mean time
# it takes for the algorithms.
for i in xrange( n ):
    # Get the starting time.
    tic = time.time()

    # Score the sequence using the forward algorithm.
    sklearn_model.score( sklearn_seq )

    # Add the time it took for this loop to the total time.
    t += time.time() - tic
scikit_learn_time = t

# Calculate the logp for reference.
logp = sklearn_model.score( sklearn_seq )

# Print the results!
print "scikit-learn: logp = {}, Total Time = {}, Time Per Loop = {}".format( logp, t, t/n )

## YAHMM ##
# Set the starting time to 0
t = 0.

# Run through this loop n times, to get a better measurement of the mean time
# it takes for the algorithms.
for i in xrange( n ):
    # Get the starting time.
    tic = time.time()

    # Score the sequence using the forward algorithm (model.log_probability is
    # a wrapper for the forward algorithm, returning only the index of
    # interest instead of the full matrix)
    yahmm_model.log_probability( sequence )

    # Add the time it took for this loop to the total time.
    t += time.time() - tic
yahmm_time = t
    
# Calculate the logp for reference.
logp = yahmm_model.log_probability( sequence )

# Print the results!
print "YAHMM: logp = {}, Total Time = {}, Time Per Loop = {} ".format( logp, t, t/n )

# Print how much faster one is than the other
if yahmm_time > scikit_learn_time:
    print "scikit_learn is {}x faster".format( yahmm_time / scikit_learn_time )
else:
    print "yahmm is {}x faster".format( scikit_learn_time / yahmm_time )


scikit-learn: logp = -136213.410941, Total Time = 78.9299998283, Time Per Loop = 0.789299998283
YAHMM: logp = -136213.410941, Total Time = 23.4089999199, Time Per Loop = 0.234089999199 
yahmm is 3.37178008879x faster

In this case, YAHMM is faster, but not as much faster as with the Viterbi decoding. It is probably safe to say it is approximately 3 times faster.

Running these speed tests shows that not only is YAHMM in the ballpark when performing hidden Markov model calculations, but is actually several times faster than scikit-learn. Again, the scikit-learn HMM implementation is currently deprecated, and so this is not a reflection of the scikit-learn team at all, but rather simply showing that YAHMM is indeed a speedy implementation of hidden Markov models.

Now lets look at the example of a large sparse HMM, instead of a small dense HMM. YAHMM was designed to perform well on large sparse HMMs, by both storing the transition matrix in a sparse format, and also by only calculating probabilities across edges which exist, instead of the full matrix calculation.

Lets define two helper functions, one for sklearn and one for YAHMM which will take in a profile as a list of dictionaries, each representing the discrete multinomial distribution at that position in the profile.


In [8]:
def sklearn_profile( profile ):
    '''
    This function will take in a profile as a list of dictionaries, where each
    dictionary represents a discrete distribution. It will then create a
    multinomial HMM from that in the sklearn format.
    '''

    # Store the length of the profile.
    n = len( profile )

    # Make it only start at the first hidden state.
    startprob = np.zeros(n)
    startprob[0] = 1.0

    # Initialize the transitions as all 0s.
    transition = np.zeros( (n,n) )

    # Iterate through each state in the profile
    for i in xrange( n ):
        # Set the probability of a self loop to 0.10
        transition[i, i] = 0.1

        # If this is the end state, we don't have a next state to transition
        # to, and so just set the self transition to 1. 
        if i == n - 1:
            transition[i, i] = 1

        # Otherwise, set the probability of transitioning to the next state to
        # 0.90.
        else:
            transition[i, i+1] = 0.9

    # Unpack the distributions into arrays where the indexes correspond to the
    # appropriate nucleotide.
    emissions = np.array( [[col['A'], col['C'], 
        col['G'], col['T']] for col in profile] )

    # Create the HMM object. 
    hmm = MultinomialHMM( n_components=n, transmat=transition, startprob=startprob )
    hmm.emissionprob_ = emissions
    return hmm

In [9]:
def yahmm_profile( profile ):
    '''
    This function will take in a profile as a list of dictionaries, where each
    dictionary represents a discrete distribution. It will then create a
    multinomial HMM from that in YAHMM format.
    '''

    # Store the length of the profile.
    n = len( profile )

    # Turn the dictionaries into DiscreteDistribution objects
    distributions = map( DiscreteDistribution, profile )

    # Create the model object 
    model = Model( "Profile" )

    # In order to connect the model, we're going to iterate through the states
    # with a pointer to the current state, and a pointer to the previous state,
    # in the same way you would go over a linked list.

    # Start off with the start state.
    last_state = model.start

    # Iterate through the distributions
    for i, distribution in enumerate( distributions ):
        # Create a state object, and store it in the model.
        state = State( distribution, name=str(i) )
        model.add_state( state )


        # Add the appropriate transitions, with the start state having a
        # probability 1 transition to the first state, but otherwise the
        # previous state having a 0.9 probability transition to the next
        # state and a 0.10 probability transition to itself. 
        if i == n - 1:
            model.add_transition( state, state, 1. )
        else:
            model.add_transition( state, state, 0.1 )

        if i == 0:
            model.add_transition( last_state, state, 1. )
        else:
            model.add_transition( last_state, state, 0.9 )

        last_state = state

    # Bake the structure to finalize the internals of the model.
    model.bake()
    return model

Lets create the profile, and then pass that into the aforementioned two functions to get two models. We then sample the YAHMM model in order to get the sequence in order to test. This sequence then gets preprocessed into the integer format for the sklearn HMM to be able to read.


In [10]:
np.random.seed(0)
random.seed(0)

# Set the length of the profile
m = 500

# Initialize the profile with probability 0.01 for each of the nucleotides
# to begin with. 
profile = [{ 'A': 0.01, 'C': 0.01, 'G': 0.01, 'T': 0.01 } for i in xrange(m)]

# For each column, choose one nucleotide which should be mostly represented
# at this position, and set that probability to 0.97 while the rest remain
# 0.01, allowing for slight deviations from the profile.
for i in xrange( m ):
    nucleotide = random.choice(['A', 'C', 'G', 'T'])
    profile[i][nucleotide] = 0.97

# Create the sklearn model using this profile.
sklearn_hmm = sklearn_profile( profile )

# Create the YAHMM model using this profile. 
yahmm_hmm = yahmm_profile( profile )

# Randomly generate a sequence using YAHMMs model.
sequence = yahmm_hmm.sample( 2*n )

# Convert the sequence from nucleotide characters to numerical indexes for
# scikit-learn to use later.
d = {'A':0, 'C':1, 'G':2, 'T':3 }
sklearn_seq = map( d.__getitem__, sequence )

Now lets go through the same time tests as last time, first Viterbi and then forward.


In [11]:
### VITERBI ALGORITHM TIME TEST ###
## SCIKIT-LEARN ## 
# Set the running time to 0.
t = 0

# Perform the time test m times.
for i in xrange( n ):
    # Get the starting time
    tic = time.time()

    # Decode the sequence using the viterbi algorithm.
    sklearn_hmm.decode( sklearn_seq )

    # Append the time it took to the growing list. 
    t += time.time() - tic
scikit_learn_time = t
    
# Calculate logp for reference.
logp, path = sklearn_hmm.decode( sklearn_seq )

# Print results.
print "SKLEARN HMM: logp {}, Total Time: {}, Time Per Loop: {}".format( logp, t, t/n )

## YAHMM ## 
# Set the running time to 0.
t = 0

# Perform the time test m times.
for i in xrange( n ):
    # Get the starting time.
    tic = time.time()

    # Decode the sequence using the viterbi algorithm.
    yahmm_hmm.viterbi( sequence )

    # Append the time it took to the growing list.
    t += time.time() - tic
yahmm_time = t
    
# Calculate logp for reference.
logp, path = yahmm_hmm.viterbi( sequence )

# Print results
print "YAHMM HMM: logp {}, Total Time: {}, Time Per Loop: {}".format( logp, t, t/n )

# Print how much faster one is than the other
if yahmm_time > scikit_learn_time:
    print "scikit_learn is {}x faster".format( yahmm_time / scikit_learn_time )
else:
    print "yahmm is {}x faster".format( scikit_learn_time / yahmm_time )


SKLEARN HMM: logp -87.2849568201, Total Time: 42.8339998722, Time Per Loop: 0.428339998722
YAHMM HMM: logp -87.2849568201, Total Time: 4.91499996185, Time Per Loop: 0.0491499996185
yahmm is 8.71495426341x faster

In [12]:
### FORWARD ALGORITHM TIME TEST ###
## SCIKIT-LEARN ##
# Set the running time to 0.
t = 0

# Perform the time test m times.
for i in xrange( n ):
    # Get the starting time.
    tic = time.time()

    # Score the sequence using the forward algorithm.
    sklearn_hmm.score( sklearn_seq )

    # Append the time it took to the growing list.
    t += time.time() - tic
scikit_learn_time = t
    
# Calculate the logp for reference.
logp = sklearn_hmm.score( sklearn_seq )

# Print results
print "SKLEARN HMM: logp {}, Total Time: {}, Time Per Loop: {}".format( logp, t, t/n )

## YAHMM ##
# Set the running time to 0.
t = 0

# Perform the time test m times.
for i in xrange( n ):
    # Get the starting time.
    tic = time.time()

    # Score the sequence using the forward algorithm.
    yahmm_hmm.log_probability( sequence )

    # Append the time it took to the growing list.
    t += time.time() - tic
yahmm_time = t
    
# Calculate the logp for reference.
logp = yahmm_hmm.log_probability( sequence )

# Print results.
print "YAHMM HMM: logp {}, Total Time: {}, Time Per Loop: {}".format( logp, t, t/n )
    
# Print how much faster one is than the other
if yahmm_time > scikit_learn_time:
    print "scikit_learn is {}x faster".format( yahmm_time / scikit_learn_time )
else:
    print "yahmm is {}x faster".format( scikit_learn_time / yahmm_time )


SKLEARN HMM: logp -78.1659697372, Total Time: 359.721999884, Time Per Loop: 3.59721999884
YAHMM HMM: logp -78.1659697372, Total Time: 15.1229996681, Time Per Loop: 0.151229996681
yahmm is 23.7864185531x faster

It looks like YAHMM is significantly faster than scikit-learn in both contexts. An interesting discovery is that the speed ratio for the Viterbi algorithm favors YAHMM more in the small dense example, instead of the large sparse example that YAHMM was built for. This seems more than made up for on the sum-of-all-paths decoding though.

One concern worth addressing is that in this notebook, YAHMM was compared against the MultinomialHMM class. The GaussianHMM or GMMHMM models may be faster. A more thorough time test may compare YAHMM versus all of these models to compare speed.

YAHMM Information

YAHMM is freely available under the MIT license at https://github.com/jmschrei/yahmm. Feel free to contribute or comment on the library!

Installing YAHMM is easy, by simply calling pip install yahmm. Dependencies are numpy, scipy, matplotlib, networkx, and Cython. If you do not have these, and pip fails when building the prerequisites, the Anaconda distribution is a good way to get many scientific packages for Python, and has been shown on multiple platforms to allow YAHMM to install.

If you have questions or comments, my email is jmschreiber91@gmail.com.